Analyzing Schema.org
نویسنده
چکیده
Schema.org is a way to add machine-understandable information to web pages that is processed by the major search engines to improve search performance. The definition of schema.org is provided as a set of web pages plus a partial mapping into RDF triples with unusual properties, and is incomplete in a number of places. This analysis of and formal semantics for schema.org provides a complete basis for a plausible version of what schema.org should be. Schema.org1 “provides a collection of schemas, i.e., html tags, that webmasters can use to [mark up] their pages in ways recognized by major search providers.”2 The major search engine providers, including Bing, Google, Yahoo!, and Yandex use schema.org markup to improve the display of search results and schema.org has been designed by and is controlled by these organizations. This makes schema.org markup an important kind of machine-understandable data in the web. Not only are there many web pages with schema.org information, but this information is used in important ways. Aside from being a collection of schemas, schema.org is a language for representing information on the Web, different from other languages used for this purpose, such as RDF [1, 2], OWL [3, 4], and the language underlying Freebase [5]. Using this language, the schema.org schemas are organized into a simple taxonomy by generalization relationships and other ontolological aspects of schema.org information are specified. The publicly available definition of schema.org is, however, incomplete and contradictory. It is only provided as English text on various web pages in schema.org, plus mappings of the collection of schemas3 into RDF (http://schema.org/docs/full md.html) and OWL (http://schema.org/docs/schemaorg.owl). The RDF mapping centrally uses non-RDFS properties, such as http://schema.org/domainIncludes, so it is not possible to determine the meaning of schema.org constructs from the RDF mapping. The OWL mapping is somewhat better, as domains and ranges employ OWL unions, but the mapping is only a translation of part of what defines schema.org. The lack of a complete definition of schema.org limits the possibility of extracting the correct information from web pages that have schema.org markup. This paper provides a full basis for schema.org as it should be, filling in the holes in the available descriptions of schema.org and fixing up discrepancies. The paper provides both a pre-theoretic analysis of schema.org and an abstract syntax and formal model-theoretic semantics for schema.org. This paper does not, however, draw on 1 Throughout this paper schema.org refers to the general idea and schema.org refers to the collection of documents available at the https://schema.org web site. 2 From https://schema.org, as of 1 April 2014. 3 See http://schema.org/docs/datamodel.html. the use of schema.org on web pages. Researchers can use the basis provided here to further investigate the properties of schema.org and schema.org markup. Providers of schema.org data can use this basis to reliably determine the meaning of the schema.org data they create. Developers can use this basis to build software that uses schema.org markup as information in a way that is compatible with the description of schema.org. Description of Schema.org at schema.org The description of schema.org in this section of the paper is taken from information on the web pages in schema.org, as of 1 April 2014. It ignores most of the surface syntax aspects of schema.org, concentrating on the underlying concepts and their intent. Schema.org information is about items, e.g., the movie Avatar. Items can have types, e.g., the type identified by the URL http://schema.org/Movie. Items can have associated property-value pairs, e.g., the property identified by http://schema.org/director with value "James Cameron". The value in a property value pair can be text, i.e., a Unicode string; a literal, e.g., a number or date; a URL, which identifies an item; or another item. There is no requirement that properties have only a single value for an item. Items can have associated URLs, e.g., http://www.avatarmovie.com/index.html and http://en.wikipedia.org/wiki/Avatar (2009 film), each of which identifies the item. Schema.org provides a collection of types, via pages in schema.org, organized in a multi-parent generalization hierarchy. Each type is identified by the URL of the page that provides its definition. Each type has a set of parents, i.e., more-general types. Each type, except for datatypes, has a set of allowable properties for the type. The types that are more specific than http://schema.org/Enumeration are enumeration types that also specify a set of URLs identifying all the items that are instances of the type. Datatypes are the types more specific than http://schema.org/Datatype and implicitly provide a set of non-item data values for them and a mapping from text to these values. Schema.org also provides a collection of properties, again from schema.org, which may be also organized in a multi-parent generalization hierarchy.4 Each property is identified by the URL of the page that provides its definition. Each property may have one or more types as domains, and can be used on items belonging to any of these types. Each property has one or more types as ranges, and values for the property belong to one or more of these types. However, property values can always be provided as just text. There is a description of an extension mechanisms for schema.org, which only permits very simple extensions. It appears that the extension mechanism exists only to further subdivide existing schema.org properties, classes, and enumeration items and that these extensions are ignored within schema.org. The translations of the type and property definitions of schema.org into RDF and OWL abide by the above description, except that there is no translation for the property hierarchy. These translations provide no extra information beyond what is given here. 4 At the time of writing of this paper there was no general notation of the property hierarchy. While this paper was in review, the property hierarchy was officially announced (http://lists.w3.org/Archives/Public/public-vocabs/2014Jun/0095.html). Analysis of schema.org as a Description of Schema.org There are quite a number of aspects of schema.org and schema.org markup that are left unspecified in schema.org, are unclear, or raise issues. This section describes these aspects and provides extra assumptions that will be used in the account for schema.org presented here. The extra assumptions have been made in a way that is congruent with the information on schema.org, that make sense in an environment where there are large central consumers of large amounts of data, and that generate a reasonable representation formalism. (In several places, the comments in schema.org do not match the actual class or property, for example, instances of http://schema.org/StructuredValue are not strings, but this sort of mismatch is not the subject of this paper.) It is unclear whether types and properties can also be items. However, items work quite differently from types and properties, and having arbitrary web pages being able to modify the types and properties of schema.org leads to difficulties, such as not being able to determine when a property is valid for an item until after all item information has been processed, so this account treats types and properties as being different from items. In particular, in this account different URLs that identify the same item do not identify the same type or the same property. Data values also act differently from items, so this account treats them as being disjoint from types, properties, and items. The identifiers of types and properties are different in schema.org, as URLs for types have initial capitals and URLs for properties do not, so it is fairly obvious that types are disjoint from properties. Schema.org uses URLs as identifiers. URLs can be used to retrieve web pages, and this aspect of URLs is a main basis of schema.org. URLs officially can include fragment ids, and such URLs then identify parts of web pages. Although fragment identifiers are not currently used for any types and properties in schema.org, there is nothing technical preventing their use, and so they will be allowed in the account herein for types, properties, and items. It is unclear whether schema.org types and properties must be identified by URLs in schema.org, but all current schema.org types and properties are so identified. This account does not formally make the assumption that types and properties must be identified by URLs in schema.org, but some of the pragmatic analysis does make the assumption that type and property definitions change infrequently, as is the case for types and properties identified by URLs in schema.org. The mechanisms for working with datatypes are underspecified in schema.org. This account adds in a formal mechanism for determining the set of values for a datatype and a formal method for determining the data value corresponding to a text string for the datatype. The name of the most general datatype in schema.org is http://schema.org/Datatype. This is an unfortunate name—http://schema.org/Literal would be much better—but the schema.org name will be used in this account. The name of the datatype for floating point numbers in schema.org is http://schema.org/Float. http://schema.org/Float and http://schema.org/Integer both have generalization http://schema.org/Number. This can lead to problems because floating point numbers are imprecise whereas integers are precise. This account, however, does not address the issue. It is unclear whether the instances of an enumeration have to be items, or can also be data values. This account assumes that the instances of an enumeration are given as URLs, as is the case for all examples currently in schema.org, and thus that instances of an enumeration are items, not data values. Some examples in schema.org only make sense if different but similar URLs identify different items. This is particularly the case for URLs that make up enumerations. This account assumes that different URLs in an enumeration identify different items, but does not otherwise assume that different URLs in the same namespace, e.g., different Wikipedia URLs, or in the same document identify different items. This extra assumption would be easy to add. The domains of a property are specified both as part of types and as part of properties in schema.org. In all the examples there is no divergence between the two specifications, but the possibility of divergence is not ruled out. This account treats the specification in the type as the actual specification, as that seems to make more sense for disjunctive domains. Because several properties indicate that they are subproperties of other properties, this account incorporates a multi-parent property hierarchy. There are some additions to the account herein that have to be made to support the property hierarchy. Both domains and ranges of properties are disjunctive. This is different from most other representation formalisms, such as description logics [6] and RDF [2]. The stated rationale for this decision is that it reduces the need for general types that exist only to be domains or ranges. However, disjunctive domains and ranges mean that additions to a collection of schema.org information can be non-monotonic. The disjunctive nature of domains and ranges is fully explored in this account, including how it interacts with the property hierarchy. Several aspects of the predominant syntaxes for schema.org markup obscure the workings of schema.org. This account transforms these aspects of surface syntax into a different abstract syntax. Several types and properties are used as part of the foundations of schema.org in schema.org. Nearly all uses of these types and properties as general types or properties undermines the foundations of schema.org, so their use is disallowed in this account. The extension mechanism for schema.org is of very limited utility and appears to not have any effect on the processing of schema.org markup, so it is ignored in this account. Description of Schema.org as It Should Be This section contains a pre-theoretic description of schema.org and schema.org content as it should be, consonant with the discussion in the previous section. This description is designed to say how schema.org could work in a way that can be easily turned into a formal definition of schema.org, as is done in the following section of this paper. Throughout this account, a URL is a uniform resource locator, optionally including a fragment part. The document (fragment) at that URL is (the appropriate fragment of) the document obtained by the usual web mechanisms for retrieving a document given a URL. URLs will be generally written as CURIES [7], with the prefixes s expanding to http://schema.org/ and w expanding to http://en.wikipedia.org/wiki/, and the prefixes rdf, rdfs, and owl expanding to their usual expansions. The constituents of schema.org information are types, properties, data values, and items. There is a collection of types, in a multi-parent generalization taxonomy, with two roots, s:Thing and s:Datatype. Each type is identified by a unique URL. The document (fragment) at that URL defines the type, listing: 1. some types that are more general than it (its parents), and 2. for non-datatypes, its properties (see below). Parents and properties, and information about instances where appropriate, are the only information about a type obtainable from its defining document (fragment). Each type has as a generalization (not necessarily directly specified in its defining document) either s:Thing or s:Datatype, but not both. The types with strict generalization s:Datatype are datatypes. All the data values belonging to the datatype are described in the datatype’s defining document (fragment), as is a way of transforming text strings into these data values. The datatypes are s:Boolean, s:Date, s:DateTime, s:Number, s:Float, s:Integer, s:Text (Unicode strings), s:URL, and s:Time. The details of these datatypes do not matter for this account, except for s:Text, and are not described here. The type s:Enumeration has s:Thing as a parent.5 Those types with strict generalization s:Enumeration are enumeration types. All those items with the enumeration type as a direct type are listed in the type’s defining document (fragment). Different URLs identify different items in an enumeration. The type s:Thing has properties s:description and s:name.6 There is a collection of properties, disjoint from types, in a multiple-parent generalization taxonomy with multiple roots. Each property is identified by a unique URL. The document (fragment) at that URL defines the property, providing: 1. types that its values belong to (its ranges), and 2. some properties that are more general than it (its parents). Ranges and parents are the only information about a property obtainable from its defining document (fragment). For each parent of the property for each range of the property the parent must have a range that is the same as or a generalization of the range. This condition on property ranges means that the validity of a property value can be checked by looking only at the range types of the property itself. The properties s:description and s:name both have range s:Text. Data values belong to one or more datatypes, and are disjoint from types and properties. Data values are written as a combination of a URL identifying a datatype and a 5 Enumeration actually has a different supertype on schema.org but this account removes the unneeded supertype. 6 There are several other properties for s:Thing on schema.org, but these do not play a role in this account and are ignored here. text string. The mapping in the datatype turns the text string into a value of the datatype. Every data value belongs to s:Datatype. If a data value belongs to a datatype then it belongs to the parents of the datatype. Items are things in the world, including information things, and are disjoint from types, properties, and data values. Items belong to (one or more) non-datatype types. Items have zero or more URLs identifying them. Items are associated with (other) items and data values via properties. Every item belongs to s:Thing. If an item belongs to a type then it belongs to the parents of the type. If an item or data value is associated with an item via a property then the item or data value is also associated with the item via each parent of the property. For each item or data value associated with an item via a property, 1. one of the item’s types has the property as one of its properties, and 2. the item or data value belongs to one of the ranges of the property. The documents (document fragments) at the URLs identifying an item provide information about the item, including types for the item as well as items and data values associated with the item via properties. Bare text can be used as if it was the value for any property. If the property does not have s:Text or s:Datatype as one of its ranges, but does have one or more datatypes as a range that have a data value that can be written as the bare text then the actual value for the property is one of these data values. If the property does not have s:Text or s:Datatype as one of its ranges, and does not have any suitable datatypes as a range, but does have one or more non-datatypes as a range, then the actual value for the property is some item that has a type that is one of these ranges and this item has the text as a value of its s:description property. (The property s:description is used instead of s:name, as the text might not truly be a name for the value.) Otherwise the actual value for the property is the bare text itself. Any surface syntax must provide ways to write all possible data values (as long as they are not too big). Any surface syntax must have ways to provide items with any number of types, including none, and values for any property of any of the provided types or their generalizations or s:Thing, including allowing multiple values for a property. Any surface syntax must provide ways for writing items with no identifying URLs. Any surface syntax must specially process syntax that would otherwise produce values for s:additionalType, turning the values into types; and s:url and s:sameAs, turning the values into identifying URLs. The following URLs are not used to identify types or properties: s:Class, s:Property, s:domainIncludes, s:rangeIncludes, rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range, rdfs:type, rdfs:Class, rdf:Property, and owl:Class. If they are used in a surface syntax to provide information about an item they and their values must be ignored. The following URLs are not used to identify properties: s:additionalType, s:url, and s:sameAs. Formal Definition for Schema.org This definition for schema.org defines an abstract syntax for schema.org, abstracting away from the details of the various surface syntaxes, and a model-theoretic semantics, that provides a formal meaning for schema.org. It conforms to the pre-theoretic description above.
منابع مشابه
Domain Specific Semantic Validation of Schema.org Annotations
Since its unveiling in 2011, schema.org has become the de facto standard for publishing semantically described structured data on the web, typically in the form of web page annotations. The increasing adoption of schema.org facilitates the growth of the web of data, as well as the development of automated agents that operate on this data. Schema.org is a large heterogeneous vocabulary that cove...
متن کاملExposing Library Holdings Metadata in RDF Using Schema.org Semantics
Libraries have been busy transforming and publishing their data as linked open data by testing already existing semantics and developing new sets of semantics. So far, most of the efforts have focused on the bibliographic data, not the holdings and item related data that are unique to individual libraries and that help users access the information resources they need. The University of Illinois...
متن کاملAnalysis of Schema.org Usage in the Tourism Domain
Schema.org is an initiative founded in 2011 by the four-big search engine Bing, Google, Yahoo!, and Yandex. The goal of the initiative is to publish and maintain the schema.org vocabulary, in order to facilitate the publication of structured data on the web which can enable the implementation of automated agents like intelligent personal assistants and chatbots. In this paper, the usage of sche...
متن کاملSchema.org as a Description Logic
Schema.org is an initiative by the major search engine providers Bing, Google, Yahoo!, and Yandex that provides a collection of ontologies which webmasters can use to mark up their pages. Schema.org comes without a formal language definition and without a clear semantics. We formalize the language of Schema.org as a Description Logic (DL) and study the complexity of querying data using (unions ...
متن کاملA Computer-Guided Approach to Website Schema.org Design
Schema.org offers to web developers the opportunity to enrich a website’s content with microdata and schema.org. For large websites, implementing microdata can take a lot of time. In general, it is necessary to perform two main activities, for which we lack methods and tools. The first consists in designing what we call the website schema.org, which is the fragment of schema.org that is relevan...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014